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Abstract 

Cancer is a heterogeneous disease with different combinations of genetic and epigenetic alterations 
driving the development of cancer in different individuals. While these alterations are believed to con¬ 
verge on genes in key cellular signaling and regulatory pathways, our knowledge of these pathways 
remains incomplete, making it difficult to identify driver alterations by their recurrence across genes or 
known pathways. We introduce Combinations of Mutually Exclusive Alterations (CoMEt), an algorithm 
to identify combinations of alterations de novo, without any prior biological knowledge (e.g. pathways 
or protein interactions). CoMEt searches for combinations of mutations that exhibit mutual exclusivity, 
a pattern expected for mutations in pathways. 

CoMEt has several important feature that distinguish it from existing approaches to analyze mutual 
exclusivity among alterations. These include: an exact statistical test for mutual exclusivity that is more 
sensitive in detecting combinations containing rare alterations; simultaneous identification of collections 
of one or more combinations of mutually exclusive alterations; simultaneous analysis of subtype-specific 
mutations; and summarization over an ensemble of collections of mutually exclusive alterations. These 
features enable CoMEt to robustly identify alterations affecting multiple pathways, or hallmarks of can¬ 
cer. We show that CoMEt outperforms existing approaches on simulated and real data. Application 
of CoMEt to hundreds of samples from 4 different cancer types from TCGA reveals multiple mutually 
exclusive sets within each cancer type. Many of these overlap known pathways, but others reveal novel 
putative cancer genes. 


1 Introduction 

A major goal of large-scale cancer genomics projects such as The Cancer Genome Atlas (TCGA) l[T]^, the 
International Cancer Genome Consortium (ICGC) and others is to identify the genetic and epigenetic 
alterations that drive cancer development. These projects have generated whole-genome/exome sequencing 
data measuring the somatic mutations in thousands of tumors in dozens of cancer types. Interpreting this 
data requires one to distinguish the driver mutations that play a role in cancer development and progres¬ 
sion from passenger mutations that have no consequence for cancer. Identifying driver mutations directly 
from sequencing data is a significant challenge since individuals with the same cancer type typically exhibit 
different combinations of driver mutations ||^[^. This mutational heterogeneity arises because driver mu¬ 
tations target genes in pathways - collections of interacting genes that perform a biological function (e.g. 
signaling or regulation) - such that each pathway can be perturbed in numerous ways pT[ . 

The observed mutational heterogeneity in cancer has motivated the development of methods to exam¬ 
ine combinations of mutations, including methods that examine known pathways or networks (reviewed in 
Gun))- ] dowever, most pathway databases and interaction networks are incomplete, lack tissue-specificity, 
and do not accurately represent the biology of a particular cancer cell. Thus, de novo methods for examining 
combinations of mutations are of particular interest as they require no prior biological knowledge and enable 
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the discovery of novel combinations. Unfortunately, the number of possible combinations is too large to test 
exhaustively and achieve statistically significant results. Current de novo approaches to identify putative 
combinations of mutations use the observation that mutations in the same pathway are often mutually ex¬ 
clusive 0 - This observation follows from the observation that there are relatively few driver mutations in 
a tumor sample, and these are distributed over multiple pathways/hallmarks of cancer 1151. 

In 2011, three algorithms for identifying sets of genes with mutually exclusive mutations were intro¬ 


duced simultaneously: the De Novo Driver Exclusivity (Dendrix) 116|, Recurrent Mutually Exclusive aber¬ 
rations (RME) p7| , and Mutual Exclusivity Modules (MEMo) | [T8| algorithms. Dendrix and RME are both 
de novo algorithms for identifying gene sets with mutually exclusive mutations, while MEMo examines 
mutual exclusivity on a protein-protein interaction network. The Dendrix algorithm identifies sets M of 
k genes with high coverage (many samples have a mutation in the set) and approximate exclusivity (few 
samples have a mutation in more than one gene in the set). Dendrix combines these two criteria into a 
weight W{M), which is equal to the coverage of M minus the coverage overlap (co-occurring mutations) 


of M. Einding the set of maximum weight is an NP-hard problem 1161. Dendrix uses a Markov chain Monte 
Carlo (MCMC) algorithm to sample high weight gene sets; more recently other optimization methods have 
been used to find high weight sets Eeiserson et al. |211 introduced the Multi-Dendrix algorithm to 

identify multiple mutually exclusive gene sets simultaneously using an integer linear program. In contrast, 
RME defines the exclusivity weight as the percentage of covered samples that contain exactly one muta¬ 
tion within a gene set, and uses an online-learning linear threshold algorithm to identify groups of genes 
with high pairwise exclusivity. However, both the RME and MEMo algorithms were shown not to scale to 
reasonable-sized datasets |2T|, requiring extensive filtering of input data p7|[22]|. 


(a) 

Dendrix m, 

W(M)= 144 m, 
CoMEt m, 
^(M)= 1.1x10-^ m, 
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Eigure 1: (a) Alteration matrices illustrating differences between the combinatorial weight function 
W (M) introduced in Dendrix and the probabilistic score T'(M) used in CoMEt. Both matrices contain 
4 mutually exclusive alterations whose alteration frequencies are indicated inside each bar. The samples 
without alterations are not shown in either matrix. Since both sets are exclusive and have the same total 
alteration frequency, the Dendrix weight function does not distinguish between these sets. Sets like M 
(left) are common in cancer genome studies which often have a small number of recurrently mutated genes 
and a long tail of rarely mutated genes. The score used in CoMEt conditions on the observed frequencies 
of each alteration, giving more significance to the set M'. (b) An example of 2 x 2 x 2 contingency 
table Xm for the set M = {mi, m2, m3}, illustrating how samples are cross-classified into exclusive, 
co-occurring, or absent for each alteration. The test statistic <?i(M) used by CoMEt is the sum of the 
highlighted exclusive cells. 

One limitation of the combinatorial weight function used in Dendrix and subsequent algorithms is that 
genes with high mutation frequencies (high coverage) can dominate the mutual exclusivity signal, thus 
biasing the algorithms towards identifying gene sets where the majority of the coverage comes from one gene 
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(Figure [TJa)). These observations motivated the development of probabilistic models of mutual exclusivity. 
These include the Dendrix++ algorithm (an early version of the approach that we present in this paper) and 
the muex algorithm | [2^ . Dendrix++ uses a statistical score and was used in TCGA acute myeloid leukemia 
study Q. The muex algorithm | |2^ uses a generative model of mutual exclusivity and a likelihood ratio test 
to identify mutually exclusive sets. We find that muex remains sensitive to the presence of high frequency 
mutations (See Section 3.2.3 1 . Moreover, both of these approaches exhaustively enumerate gene sets to find 
fhose wifh high score, limifing fheir applicabilify fo larger dafasefs. In addition, fhey do nol idenfify multiple 
gene sefs simulfaneously, a fealure fhaf has proved useful wifh fhe Dendrix weighf | |2T| . Finally, no currenf 
mefhod idenlifies overlapping gene sef^ allhough cancer genes have been shown lo parlicipale in multiple 
pafhways |[T|, or addresses fhe problem of cancer sublype-specific mufalions which can confound fhe mufual 
exclusivity signal. 

We inlroduce fhe Combinations of Mulually Exclusive Alterations (CoMEt) algorithm to address the 
limitations outlined above. CoMEt includes the following contributions. 


1. We develop an exact statistical test for mutual exclusivity conditional on the observed frequency of 
each alteration. This approach is less biased towards high frequency alterations, and enables the dis¬ 
covery of combinations of lower frequency alterations. We derive a novel tail enumeration procedure 
to compute the exact test, as well as a binomial approximation. 

2. CoMEt simultaneously identifies collections consisting of multiple combinations of mutually exclu¬ 
sive alterations, and samples from such collections using an MCMC algorithm. We summarize the 
resulting distribution by computing the marginal probability of pairs of alterations in the same sets. 
This enables CoMEt to identify sets of any size, including overlapping sets of alterations, without 
testing many parameter settings. 

3. Given prior knowledge of cancer-types/sub types, CoMEt analyzes alterations and subtypes simulta¬ 
neously, allowing the discovery of mutually exclusive alterations across cancer types, while avoiding 
the identification of spurious mutually exclusive sets of (sub)type-specfic mutations. 


We demonstrate that CoMEt outperforms earlier approaches on simulated and real cancer data. We apply 
CoMEt to acute myeloid leukemia (AME), glioblastoma (GBM), gastric (STAD), and breast cancer (BRCA) 
data from TCGA. In each cancer type, we identify combinations of mutated genes that overlap known 
cancer pathways and and also contain potentially novel cancer genes including IL7R and the EphB receptor 
EPHB3 in STAD, and the scavenger receptor SRCRB4D in GBM. On the gastric and breast cancer data, 
we demonstrate how CoMEt simultaneously identifies mutual exclusivity resulting from pathways and from 
subtype-specific mutations. CoMEt is available at http : / /compbio . cs . brown . edu/software/ 
comet 


2 Methods 

2.1 Overview of the CoMEt Algorithm 

We consider that a set S of m alterations have been measured in n samples. An alteration may be the 
somatic mutation of a particular gene, a specific single nucleotide mutation (e.g. V600E mutations in the 
BRAF gene), an epigenetic change such as hypermethylation of a promoter, or a variety of other changes. 
We assume that alterations are binary, such that alterations are either present or absent in each sample. 
We represent the set of measured alterations with an m x re binary alteration matrix A = [oij], where 
aij = 1 if alteration i occurs in sample j, and atj = 0 otherwise. Our goal is to identify one or more sets 

*We note that while Multi-Dendrix |2l| allows for searching for overlapping gene sets, this option was never explored. 
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Figure 2: Overview of the CoMEt algorithm. First, we transform alteration data from different measure¬ 
ments into a binary alteration matrix A. Second, for fixed values of k and t we use a Markov Chain Monte 
Carlo (MCMC) algorithm to sample collections M in proportion to the weight d>(M)“^. Here we show a 
collection containing sets M and M' with three and two alterations, respectively. We identify all collections 
whose weight exceeds the maximum observed in randomly permuted datasets. We summarize the alterations 
in these significant collections with a marginal probability graph, whose edge weights indicate the fraction 
of significant collections with the corresponding pair of alterations. 


Ml, M 2 ,..., Mt where the alterations in each Mi are surprisingly mutually exclusive across the n samples. 
We introduce the CoMEt algorithm for this purpose. 

CoMEt uses a novel statistical score based on an exact test for mutual exclusivity. Figure [T] motivates 
the development of the new score, showing two sets M and M', each with four alterations. The alter¬ 
ations in both sets are perfectly exclusive (no sample has more than one alteration), and the total number 
of altered samples is the same. The Dendrix weight function W{M) introduced in |[T^ (and used in later 


publications 119 -211) is defined as the coverage, the number of samples with at least one mutation in M, 
minus the coverage overlap, the number of samples with more than one mutation in M. In this case, 
W{M) = W{M'). However, given the frequencies of each alteration, we are more surprised to observe 
mutual exclusivity among alterations in the set M', which are each altered in 7% of samples, than we are to 
observe mutual exclusivity among the alterations in set M, where a single alteration has very high frequency 
(25%) and three alterations have relatively low frequency (< 2%). Sets like M are common in many cancer 
datasets where highly recurrent alterations (e.g. mutations in TP53 or amplification of EGFR) occur and can 
be combined with low frequency, spurious alterations. This problem was noted in but the probabilistic 
model introduced therein seems to overcorrect for this effect, missing important combinations of alterations 


(See Section [3.2.3| ). 

We derive a score <h(M) for a set M ofk alterations using an exact test of mutual exclusivity. Specifi¬ 
cally, we examine a 2x2x---x2 = 2'= contingency table Xm (Figure [^b)) whose entries indicate the 
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number of samples where each combination of alterations occur. For example, the entry X( 24 ) of equals 
the number of samples where the second and fourth alterations in M occur, but the first and third alterations 
do not occur. The score <1>(M) is the P-value of the observed mutual exclusivity in the table Xm, where the 
margins of the table (determined by the number of samples where each alteration occurs) is fixed. That is 
the score <1>(M) is conditional on the observed frequencies of alterations in M. This statistical score reduces 
the effect of the most frequent alterations have an unduly large contribution to the score. 

CoMEt scores a collection M = (Mi, ..., Mt) of t alteration sets by taking the product of the scores of 
each set Mi\ 

t 

^>(M) = n HMi). (1) 

i=l 

This score follows from the null hypothesis that exclusivity is independent across sets. 

Since the number of possible collections of alteration sets grows exponentially with the number of 
alterations, it is typically impossible to enumerate and compute the weight of all alteration sets. We use 
a Markov chain Monte Carlo (MCMC) algorithm to sample collections M of f alteration sets, where each 
collection is sampled with proportion to its weight <1>(M)“^. We summarize this distribution by computing 
the marginal probability p{e, e') for each pair of alterations in A. We summarize these probabilities using 
the marginal probability graph, a complete, undirected weighted graph G = {V,E) where V = E and 
where each edge e G P connects a pair of vertices n, v with weight p{u, v). We identify the most exclusive 
alteration sets by first removing all edges from the graph weight below a threshold 6. The output of CoMEt 
is C{6), the connected components in the resulting graph, which we call modules. The summarization via 
the marginal probability graph allows CoMEt to output collections of alteration sets different in number and 
size than specified by the input parameters. 


2.2 Scoring mutual exclusivity 

We first describe a statistical score <1>(M) for a tuple M = (mi,..., nik) of alterations. The score measures 
the surprise of the observed exclusivity of these alterations conditional on the rate of occurrence of each 
alteration. Since these rates are generally unknown (e.g. the background mutation rate for single nucleotide 
mutations varies greatly across genes and samples p^), we use the exact distribution obtained from the 
observed data as the null distribution. Under this distribution, the status of the k alterations in n samples 
is described by selecting uniformly a k x m binary alteration matrix B with the constraint that the number 
of I’s in row i of B equals the number of I’s in row m* of the alteration matrix A. This distribution is 
equivalent to the sampling distribution on 2x2x---x2 = 2^ contingency tables under the hypergeometric 
distribution, where dimension i of the table gives the cross-classification of the number of samples where 
alteration i occurs or not. Eor example, three alterations are described by a 2 x 2 x 2 table with margins 
equal to the frequency of each alteration (Eigure[TJb)). 

We introduce notation to describe the statistical test. Given a set M of alterations, let xt., be the number 

U) 

of samples where alteration rrij occurs. It follows that n — is the number of samples where rrij does 
not occur. Similarly, for v C [/c] = {1,..., k}, let Xv denote the number of samples where alterations only 
occur in m^,. The values Xv for all v C [k] give the entries of a 2^ contingency table Xm with fixed margins 
x+ = (x^j,..., Thus, fhe probability of observing a 2^ contingency table Xm with fixed margins 

x'*' and whose sum of entries equals n follows the multivariate hypergeometric distribution 


FXm 


Pr(XM|x''', k, n) 


(n!)^-i rivcifc] 


( 2 ) 
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To characterize the mutual exclusivity of alterations in a contingency table, we define the test statistic 
as the sum of the entries in the contingency table where exactly one alteration occurs, i.e. T{X.m) = 
^{j}’ where x^jj is the number of samples where alterations occur only in rrij. We compute a P- 
value for the observed value T{'Km) of the test statistic as the tail probability of observing tables with the 
same margins whose exclusivity is at least as large as observed: 

Pr(r>r(XM)|x+,A:,n)= Pr(Y|x+, A:, n), (3) 

Yer(x+): 

T(Y)>T{Xm) 


where T(x+) is the set of 2^ contingency tables with margins x+. Note that for k = 2, the test 
statistic T{'X.m) is equivalent to a one-sided Fisher’s exact test. 2x2 contingency tables have only one 
degree of freedom, and thus there are essentially only two ways in which the corresponding pair of random 
variables can be non-independent: having too many co-occurrences or too much exclusivity (Figure [TJb)). 
However, 2^ tables have 2*^ — A: — 1 degrees of freedom and there are many ways in which the corresponding 
random variables can be non-independent. The T(Xm) test statistic measures whether the alterations are 
surprisingly mutually exclusive, rather than non-independent in some other way. 

We define fhe score <1>(M) using fhe mid P-value p5j , which is fhe fhe average of fhe probabilify 
of observing a value af leasf as exfreme as fhe observed value and observing a value more exfreme fhan 
observed: 

$(M) = ^{Pr{T > r(XM)|x+,/c,n) + Pr{T > T{Xm)\^+, k,n)). (4) 


We use fhe mid P-value because fhe fail probabilify from exacf fesfs is fypically overly conservafive, due 
fo fhe discrefeness of fhe exacf disfribufion p5| . Finally, since cancer is driven by mufafions in multiple 
pafhways 115], we define a score <1>(M) for a collection M = (Mi, M 2 ,..., Mt) of t gene sefs as <1>(M) = 
0^=1 The producf resulfs from our assumpfion fhaf under fhe null hypofhesis mufafions in differenf 

sefs Mi are independenf. 


2.3 Computing the mutual exclusivity score <I>(M) 

To compufe fhe mufual exclusivify score 4>(M), one musf compufe Q. This requires computing fhe prob¬ 
abilify of all fables Y wifh fhe same margins as Xm and wifh exclusivify sfafisfic r(Y) af leasf as large 
as fhe observed value T(Xm)- Unforlunalely, no algorifhm is known fo enumerafe such fables. In general 


fhe problem of counfing confingency fables wifh fixed margins are #P-complefe |261, and fhus if is unlikely 
fhey can be enumerafed efficienfly. Several mefhods have been proposed fo solve fhe problem of counfing 
confingency fables, including using fhe network algorithm p7||2^ for Fisher’s exact test in r x c contin¬ 
gency tables, or extensions to consider joint effect of two contingency tables (i.e. 2 x r x c) |29|. Branch 
and bound heuristics have also been used in some specialized cases | [30| . However, these approaches still 
consider at most three dimensional contingency tables, and the problem of enumerating 2^ tables does not 
seem to have been considered. Even for small k the enumeration problem is intractable: the number of 2^ 


tables with fixed margins grows exponenfially in A:. |311 presenfed an exhausfive algorifhms fo enumerafe 
all 2^ and 2“^ confingency fables wifh fixed margins, demonsfrafing for example fhaf for n = 36, fhere are 
>100 million 2^ fables. Randomized and approximafe counfing mefhods for confingency fables have been 
developed (e.g. | [32p3| and references fherein), alfhough fhese generally do nof provide a rigorous guaranfee 
on fhe error in fhe approximafion. 

We derive a novel tail enumeration algorifhm fo efficienfly compufe fhe fail probabilify in equafion Q 
for fables wifh high values of fhe exclusivify sfafisfic T. The mofivafion for our approach is fhaf fhe sefs 
M of inferesf will have exfremely high values of T{X.m), near fhe maximum possible value. For example, 
in fhe degenerate case of perfecl exclusivify (no sample wifh more fhan one alferafion in M) fhere are no 
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more extreme tables to enumerate, and the algorithm needs only to evaluate the hypergeometric probability 
of equation @ for this single table. Thus, if we enumerate tables starting from the highest possible values 
for T, we can obtain highly accurate P-values for the most interesting cases. Furthermore, we can stop the 
enumeration procedure when the P-value becomes sufficiently large and use approximations for these larger 
P-values (See below). 

Algorithm[T]is the tail enumeration strategy to enumerate contingency tables in approximate order from 
most to least exclusive. Briefly, let C = (v C [fc] : |v| > 2) be the vector of co-occurring (not exclusive) 
cells. The basic strategy employed by Algorithm [T] is to generate a table Y that is more exclusive than Xjvr 
(i.e. r(Y) > T(X.m) by iterating through the possible values of each cell in C, using the following facts: 

• When all values in C are fixed, the other values in the contingency table are uniquely determined (See 
Procedure CompleteContTbl in Algorithm [T]l. 

• We can set and update exact upper and lower bounds for each cell in C. The values of each cell 
are bounded by two values (lines 10-11 in TailEnumeration): the first is how many more co¬ 
occurrences are allowed in the current table (Trem) before Y is less exclusive than 'Km ; the second 
is given by the constrained marginal {MarRem) for that variable in Km- 

We find that Algorithm [^performs well on real data, evaluating the test statistic T{Km) in a few seconds 
for sets with k < 7 that have a small number of co-occurrences. 


Algorithm 1 Tail enumeration for any k > 1 

Input: 2^ contingency table X. 


Output: Set S of contingency tables at least as exclusive as X: 5 = Y G T{x'^) : T(Y) > T(X). 

1 



2 

N ^2'^ 


3 

C ^SORTED({v C [k] : V > 2}) 

0 Sorted descending vector of co-occurring cells 

4 

yv ^ 0, Vv C [A:] 


5 

Tmax ^ Z]i=l 

0 Sum of alteration frequencies 

6 

TAILENUMERAT10N(Y, C, Tmax “ ^(X)) 


7 

procedure TailEnumeration(Y, C, Trem) 

» Trem' count of allowed co-occurrences remaining 

8 

V ^ Head(C) 


9 

if V 7 ^ Null then 


10 

Mar Rem •(— min {yf'\ 

t> Minimum margin remaining 

11 

for {i <— L,... , m.m{MarRem, 

J})do 

12 

Y' ^ Copy(Y) 


13 

2/v ^ i 

0 Set value of cell v of Y' to i 

14 

TailEnumeration(Y',Tail(C), - V X i) 

15 

else 

t> If all “co-occurring” cells have been set 

16 

5 = 5 U {CompleteContTbl(Y)} 


17 

procedure CompleteContTbl(Y) 

t> Eill in remainder of contingency table x' 

18 

for V C [A:] : V = 1 do 

t> Iterate over exclusive cells 

19 

yv ^ a:+ - y+ 


20 

Y(o,o,...,o) ^ n - y 

0 Eill in cell with no alterations 

21 

return Y 



Binomial approximation. We can approximate the distribution of the exclusivity statistic using the bino¬ 
mial distribution, which is a well-known approximation of the hypergeometric distribution. Under the null 
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hypothesis that alterations occur independently in the samples, let pe = ’^^e probability of an 

exclusive alteration; i.e. a sample contains exactly one alteration in M. Given a set M of alterations M, then 
the probability of observing T(X.m) or more exclusive alterations in n samples is given by the binomial tail 
probability 1 - - PeT~'- 

We find that the binomial provides a good approximation of the exact test P-value for sets M with a 
large number of co-occurring mutations, and consequently a higher P-value (See Figure [S^. Conveniently, 
these are precisely the cases where the tail enumeration algorithm is slow. 


Permutation approximation. Another approximation to the exact test is obtained using a permutation 
test. We sample L tables with fixed margins uniformly from the space of all tables and compute the pro¬ 
portion of such tables whose exclusivity value T exceeds the observed value T^Km)- Of course, sampling 
uniformly from the set of tables with fixed margins is not straightforward. We use an MCMC approach 
as described in | |T^, a lthough we do not fix the number of alterations per sample. Interestingly, while the 
MEMo algorithmic uses a permutation test, the test statistic is the coverage T{M), rather than the ex¬ 
clusivity T{M) used in CoMEt. While these are equivalent when k = 2 (since there is only one degree of 


freedom), they produce different results for k > 2. See further discussion in Section 2.8 


In our implementation, we use the exact test, binomial approximation, or permutation approximation to 
compute <1>(M) according to the following procedure. Eirst, we calculate the P-value from the binomial 
approximation and compute the number of co-occurring alterations in M. If the number of co-occurring 
alterations is higher than a fixed threshold k or the binomial P-value is larger than a fixed value V’, we 
set <h(M) to be the binomial P-value. Otherwise, we perform the tail enumeration procedure to compute 
the exact test P-value, stopping the enumeration if the accumulated tail probability becomes larger than a 
threshold e. If we stop, then we compute the permutation approximation with samples, such that we 
expect to sample at least one table with T > T{'Km)- This procedure focuses the time to perform tail 
enumeration in those cases where high accuracy is needed for small P-values. 


2.4 Sampling collections of mutually exclusive alterations with MCMC 

Our goal is to identify a collection M of f alteration sets with low (highly significant) values of d>(M). 
Since is typically not possible to enumerate all such collections (except for test datasets with small m, n, 
t, and k), we derive a Markov Chain Monte Carlo (MCMC) approach to sample from the space of possible 
collections. We use the Metropolis-Hastings algorithm p4l[^ to derive a Markov chain Monte Carlo 
(MCMC) algorithm to sample collections M in proportion to the weight <I>(M)“^ (See Supplementary 
Section |Sl.l| ). 


2.5 Marginal probability graph 


We now present a method to extract a collection of highly exclusive alteration sets {with no prescribed size) 
from the posterior distribution obtained from the MCMC algorithm. Typically, there are multiple collections 
with significant scores. This might be for interesting reasons such as different sets of alterations with similar 
scores or alterations that appear in multiple mutually exclusive sets. However, the reason might also be 
suboptimal parameter selection; e.g. there may be a significant set of A: = 3 alterations in the data, but 
running the algorithm with k = 4 will return many sets with the same three genes and a fourth “spurious” 
gene. To distinguish such cases, we summarize the posterior distribution on collections using a marginal 
probability graph G. Eor a pair {i,j) of alterations, let p{i,j) denote the posterior probability that i and j are 
found in the same set. We compute p{i, j ) using the samples from the MCMC algorithm (See Supplementary 


Section Sl.l i. 






Let G = {V, E) be a complete, undirected weighted graph whose vertices are the alterations and where 
each edge e £ E connects a pair of vertices u, v with weight p{u, v). Connected subgraphs of G with many 
high-weight edges are the most exclusive alteration sets in A. We identify these most exclusive alteration 
sets by first removing all edges with weight below a threshold 6 (See Supplementary Section SI.21. Let G{d) 
be the connected components of size > 2 in the resulting graph. The output of CoMEt is the G{6) alteration 
sets. We choose connected components as the output - as opposed to some other partition of the graph such 
as cliques - in order to be able to identify other topologies such as overlapping pathways (alteration sets), 
where two sets of alterations are connected by a cut node. 


2.6 Statistical significance 

While the score <h(M) measures our surprise of observing exclusivity within each of the sets in M condi¬ 
tional on the observed frequencies of each alteration, there are a large number of possible collections, and 
thus we might observe a high score by chance. We evaluate the statistical significance of fhe collection M 
by comparing fo a null disfribufion of scores obfained on permuted alferafion mafrices A wifh fhe sample 
and alferafion frequencies (sums of rows and columns of A) fixed | [T^[^ . Lef <1>* be fhe minimum score 
obfained over N permufafions. We use fhe collections M satisfying <h(M) < <1>* (fhus each such collection 
has P-value < ;^) fo compufe fhe marginal probabilify graph. 

2.7 Simultaneous analysis of alterations and cancer subtypes 

An imporfanf confounding faclor in idenlifying cancer pafhways de novo by analyzing exclusive alferafions 
is fhaf cerfain alferafions primarily occur in parficular cancer subfypes | [37| . If we analyze a mixed sef of 
samples wifh mulfiple subfypes, fhese subfype-specific alferafions will be mufually exclusive in fhe dafa, 
even if fhey are nol in fhe same biological pafhway. When fhe subfypes are known in advance, one solution 
is fo analyze subfypes separafely; unforfunafely fhis reduces sample numbers, fhus reducing power fo iden¬ 
tify combinations of alferafions fhaf are shared across subfypes. CoMEf addresses fhis problem by adding 
one new “subfype row” fo fhe alferafion mafrix A for each subfype. This subfype row confains an alferafion 
in all samples excluding fhose of fhe given subfype. Thus, fhe sefs of alferafions fhaf are surprisingly exclu¬ 
sive wifh fhese subfype rows are fhe ones primarily altered in fhaf subfype. Nofe fhaf when running CoMEf 
wifh subfype rows, we do nof allow multiple subfypes fo be placed in fhe same sef. Because CoMEf simul- 
faneously analyzes multiple alterations sefs, CoMEf can idenfify exclusive sefs confaining subfype-specific 
alferafions, general alterations, or any combinafion of fhese. 

When analyzing fhe cancer dafasef fhaf included sample subfype classificafions, we perform fwo runs of 
CoMEf. Eirsf we ran CoMEf on fhe alferafion mafrix A. Then we ran CoMEf on fhe alferafion mafrix wifh 
“subfype rows” as we described. We summarize fhe ensemble of slafislically significanl collecfions sampled 
by fhe MCMC algorifhm in fhe fwo CoMEf runs by normalizing and combining fhe sampling frequencies of 
each collecfion across fhe fwo runs, and fhen compufing fhe marginal probabilify graph from on fhe merged 
collection. 


2.8 Comparison to MEMo 


The MEMo algorifhm |18| uses a permufafion fesf fo approximafe fhe probabilify of observing exclusive 
mufafions in a gene sef M wifh confingency fable X. The permufafion fesf works by permuting fhe rows in 
A corresponding fo fhe genes in M, and fhen determining if fhe permufafion has a higher fesf sfafisfic fhan 
M. This is fhen repeafed N times fo obfain an empirical P-value. 

The crucial difference befween MEMo and CoMEf is fhaf MEMo uses fhe coverage T{M) as fhe fesf 
sfafisfic, while CoMEf uses fhe fesf sfafisfic T{X). (Eor ease of exposition, lef r(X) also be defined a 
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the coverage for a contingency table X.) The reasoning behind using the coverage as the test statistic is 
the idea that a gene set with mutually exclusive mutations will also have the highest coverage possible, for 
fixed frequencies of individual mutations. While this is true for pairs of genes (which follows from the 
fact that 2x2 contingency tables have only one degree of freedom), when one examines three or more 
genes, maximizing coverage is not the same as maximizing exclusivity. In fact, we can see that for a 
given contingency table X it is possible to find anofher confingency fable X' wifh fhe same margins (gene 
frequencies) as X, buf fhaf has: 

1. Higher exclusivify {T{X') > T{X)) and lower coverage (r(X') < r(X)), which could resulf in a 
deflated P-value for MEMo. 

2. Lower exclusivify {T{X') < T{X)) buf fhe same coverage (r(X') = r(2f)), which would resulf in 
an inflafed P-value for MEMo0 

See examples of bofh cases in Eigurej^ 


3 Results 

3.1 Visualization of results 

We created a web applicafion for interactive visualization of fhe CoMEfresulfs (http : / /compbio-rese arch. 
cs . brown . edu/comet; See Eigure[S4|). Eor each dafasef, fhe website shows fhe modules in fhe CoMEf 
marginal probabilify graph. Users can change fhe minimum edge weighl parameter 6, which dynamically 
updafes fhe modules. Edges in each module are labeled wifh fhe marginal probabilify. Users can view fhe 
rows of fhe alteration mafrix fhaf correspond fo a given module, and also view, sort, and search fhrough fhe 
collecfions sampled by CoMEf fhaf include alferafions in a given module. 


3.2 Benchmarking and comparison to other methods 


We compared CoMEf on fwo simulated mufafion dafasefs fo fhree ofher published mefhods for finding 
mufually exclusive gene sefs: Dendrix 1161, Mulfi-Dendrix | [2T| and muex p^ . In addition, we performed 
a separafe comparison fo MEMo |[T^ (See defails in Secfion[2^. 


3.2.1 Benchmarking of exclusivity scores for individual gene sets 

We firsf compared fhe exclusivity scores used by CoMEf, Dendrix, and muex for individual gene sefs of size 
k on simulafed dafasefs fhaf represenf key feafures of cancer sequencing dafa. In parficular, each simulafed 
dafasef confains: (1) one implanted pafhway P wifh k = 3 genes fhaf is alfered in a fracfion jp samples 
wifh highly exclusive mufafions; (2) a sef (7 of 5 highly alfered genes whose alferafions are nof necessarily 
exclusive; (3) ofher genes confaining only passenger mufafions fhaf were alfered af rate q. The sef C models 
the highly recurrently altered genes that often appear in real cancer data sets, and can confound methods 
for identifying exclusive mutations. Eurther details of the simulation are in the Appendix (See details in 


Supplementary Section S2.2 1 . 

We computed the average rank of the implanted pathway across 25 simulated datasets with n = 500 
samples, varying the coverage 7 p = 0.1,0.2,..., 1.0. We ran muex with the same parameters used in p^ , 
and ranked gene sets ascending by P-value. We then compared the algorithms by the average rank of 
the implanted pathway. We chose to use the average rank rather than a alternate measure such as the true 
positive or false positive rate because our simulated datasets only include a single true positive (the implanted 


^We have not found a case where T{X') < T{X) and r(2f') > r(X), and conjecture that such a case does not exist. 
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Figure 3: Comparison of CoMEt with other methods on simulated data with n = 500 samples, (a) 

The average rank of the implanted pathway in the results from by CoMEt (blue), Dendrix (red), and muex 
(brown) in 25 simulated datasets as a function of coverage. The number of ties (i.e. gene sets with the same 
weight as the implanted pathway) are shown as error bars, (b) Comparison of CoMEt and Multi-Dendrix in 
identifying an implanted collection containing multiple sets of alterations. Bars indicate average of adjusted 
Rand index between reported and implanted collection across 25 simulated datasets. 


pathway). We note that while we were able to reproduce the muex results from |231 using these parameters 


(See details in Section 3.2.3 1 , it is possible that different parameters would improve the performance of 
muex on this dataset. 

We find that on average, CoMEt ranked the implanted pathway higher than the other methods for each 
coverage (Eigurej^. Eor coverage 7 p > 0.3, CoMEt always ranked the implanted pathway first, while 
Dendrix ranked first only for 7 p > 0.7 and muex for 7 p > 0.9. Even with extremely low coverage of 
7 p = 0.1 and 0.2, CoMEt was over an order of magnitude better than the other approaches. We also 
performed this comparison using both a smaller and larger number of samples (n = 250 and n = 750, 
respectively). We found that CoMEt improved as n increased, as larger n gave the CoMEt probabilistic 
test increased power, while Dendrix did not improve and muex still performed poorly (See Eigure [S5] ). 
We also compared the average runtimes of each weight function across all gene sets in each simulated 
dataset (Eigure [S^, finding that CoMEt (< 4 minutes) and Dendrix (< 1 minute) were much faster than 
muex (often an hour or more). This comparison demonstrates the superiority of the statistical score used 
in CoMEt, which is able to identify a pathway with low coverage alterations, even in datasets with highly 
recurrently mutated genes and many passenger mutations, and also runs in reasonable time. 


3.2.2 Benchmarking identification of collections of gene sets 


CoMEt and Multi-Dendrix |211 are the only available methods that simultaneously find collections con¬ 
taining more than one mutually exclusive set. Thus, we compared these algorithms on simulated datasets 
with overlapping and non-overlapping implanted gene sets We generated simulated data using a procedure 
similar to above with three important differences. Eirst, we implanted a collection P = (Pi, P 2 ,..., P*) 
of t pathways, each with exclusive mutations with total coverage 7 p. Second, all genes in each implanted 
pathway are mutated in the same number of samples. Third, we include m = 20, 000 genes and remove 


11 















those mutated in fewer than 1% of total samples (Figure [ST]). We generated datasets varying t from 2 to 4 
and k from 3 to 5 with coverages 7 p between 0.40 and 0.70 (See Table [S^. We also generated datasets with 
overlapping implanted pathways with t = 3, k from 3 to 5, with 7 p = (0.75, 0.75, 0.60). 

On each dataset, we ran CoMEt using k = A,t = 3, and Multi-Dendrix using its default parameters of 
t ranging from 2 to 4, and k ranging from 3 to 5. We compared the consensus sets output by Multi-Dendrix 
with the modules output by CoMEt, using the adjusted Rand index (ARI) | [^ , to score how well each 
algorithm identified the implanted pathways. The ARI measures the agreement between two partitions, with 
ARI = 1 indicating that two partitions are identical and ARI = -1 indicating that two partitions are maximally 
dissimilar. CoMEt outperformed Multi-Dendrix in 11/12 simulated datasets (each containing 25 replicates) 
(Eigure and Table S31. CoMEt found a much larger fraction of the implanted pathways (difference in 
ARI was > 0.2 for 8/12 datasets). Eurthermore, CoMEt had an ARI > 0.5 for all 12 datasets, and ARI> 0.8 
for 7/12 datasets. We emphasize that we ran CoMEt with a single value of t and a single value of k over all 
datasets even though the size and number of implanted pathways varied across datasets. In contrast, Multi- 
Dendrix was run with a range of parameter values. This demonstrates that CoMEt is much less sensitive to 
parameter choices than Multi-Dendrix. 

We also compared the output of CoMEt and Multi-Dendrix using the true values of t and k. We found 
that CoMEt outperformed Multi-Dendrix on 11/12 datasets (Table [S^. This shows that the statistical score 
used by CoMEt and the MCMC sampling are important features, even on simulated datasets where the 
implanted collections are fairly strong signals in the data. 


muex 



CoMEt 



Alterations set 

4>(M) 

Statistic 

Alteration set 

<f>(M) 

Statistic 

Multi-Dendrix GBM dataset j2l[, k — 3, t — 3 

EGFR ,PDGFRA (A) FTFN(D) 

0.0029 

1.89 

CDK4(A),CDKN2A(D),RB1 

6.3 X 10"^® 

-3.93 

FRMPD4(D),MDM2(A)FIK3CA 

0.011 

2.63 

CDKN2A(D)MDM2{A),TP53 

4.8 X 10"^'^ 

-5.38 

ABPl ARID2{T)),D USP2 7 

0.18232 

1.78 

IDHl FTENFTEN{33) 

1.1 X 10"® 

-0.55 


muex GBM dataset 

CO 

II 

II 



ABCC9FIK3CA JiPFS, TRATl 

0.96 

2.62 

EGFR, GCSAML, IDHl, OTC 

1.9 X 10"® 

-14.72 

CNTNAP2JDH1 ,KEL,SCN9A 

0.11 

6.35 

ABCC9,CDK4,CDKN2B,RPL5 

1.1 X 10"'^ 

-8.01 

CDH18,MMP13,SULT1B1,TR1M51 

0.32 

3.24 

CDK4,CNTNAP2,NF1 ,SCN9A 

5.9 X 10"® 

-2.51 


Table 1 : Collections of alterations reported by muex and CoMEt on the Multi-Dendrix GBM dataset 
PH (with k = 3 and t = 3), and on the muex GBM dataset p3j (with /c = 4 and t = 3). Bolded 
alterations indicate genes in the COSMIC Cancer Census |391. 


3.2.3 Comparison to muex on real data 


We compared CoMEt to muex |231 using two different versions of the TCGA glioblastoma (GBM) dataset: 
(1) the dataset from pT| containing 398 alterations and 261 samples; (2) the dataset from pH , containing 
83 alterations and 236 samples (See Section [S2.1[ ). There are 184 samples in both the Multi-Dendrix GBM 
and muex GBM datasets. Besides the samples, the main difference between these two datasets is that the 
muex dataset is restricted to only 83 significantly recurrent alterations. 

Since the muex score is for single alteration sets, we ran muex iteratively to identify collections of 
alteration sets. That is, we run muex to find the top scoring alteration set, remove those alterations, and 


repeat t — 1 times. We ran muex with the parameters used in 1231, restricting to alteration sets with coverage 
at least 0.3, impurity lower than 0.5, and a significance cutoff of 0.05. On the muex GBM dataset, we ran 


CoMEt and muex with k = A and / = 3 to match the parameters used in |231. On the Multi-Dendrix GBM 


12 

















dataset, we ran CoMEt and muex with k = 3 and t = 3, since muex aborted with an out-of-memory error 
for A: = 4 on this dataset. 

On both GBM datasets, CoMEt identifies collections with much more significant exclusivity. More¬ 
over, more of the genes in the CoMEt collections are known cancer genes (according to the COSMIC 
Cancer Census | |^ ) compared to the genes in the muex collections (Table [T]). On the Multi-Dendrix GBM 
dataset, CoMEt identifies fhree collecfions fhaf overlap fhe Rb {CDK4, CDKN2A, RBI), p53 {TP53, MDM2, 
CDKN2A), and PI(3)K {PTEN, IDHl) signaling pathways. Each of these sets include surprisingly exclusive 
alterations, with 4>(M) ranging from 10“® to 10“^®, and all the alterations are in cancer genes. In contrast, 
muex identifies sefs wifh lower coverage and less surprising exclusivify, wifh <h(M) > 10“^ for each sef, 
and fhree of fhe alferafions are nof in known cancer genes. 

On the muex GBM dataset, CoMEt again identifies more exclusive alferafion sefs thaf overlap more 
known cancer genes, while muex reporfs few known cancer genes wifh mosf having an uncerfain association 
wifh cancer. In general fhis dafasef seems fo include more spurious alferafions, as bofh algorifhms idenfify 
less exclusive sefs wifh fewer cancer genes fhan on fhe Mulfi-Dendrix GBM dafaset. This mighf be a resulf 
of fhe differenf handling of copy number aberrafions in the two papers (see |[^ and |[2^). 


Multi-Dendrix 


CoMEt 



Alterations set 

<f>(M) 

W(M) 

Alteration set 

$(M) 

W(M) 

Pan-cancer GBM dataset j^, fc = 3, t = 4 

CDKN2A(D),CDK4{A),RB1 

1.4 X 

160 

CDKN2AGy),CDK4{A),RBl 

1.4 X 10"^^ 

160 

TP53,MDM2(A), MDM4(A) 

3.7 X 10~® 

128 

TP53,MDM2{A),STAG2 

3.8 X 10“’’ 

119 

PTEN,PIK3CA,IDHl 

3.5 X 10~^ 

125 

PTEN,LRP1B,IDH1 

6.9 X 10“® 

112 

EGFR,NF1 ,PDGFRA(A) 

1.1 X 10“^ 

120 

EGFR,NF1,CALCR 

4.3 X 10-^ 

111 

Pan-cancer GBM dataset 

without MutSigCV filter, fe = 3, t = 

4 


CDKN2A(D),CDK4{A),RB1 

1.4 X 10~^® 

160 

CDKN2AGy),CDK4{A),RBl 

1.4 X 10“^^ 

160 

TP53,MDM2(A)EGFR 

3.6 X 10“^ 

144 

TP53,MDM2{A),STAG2 

3.8 X 10“’’ 

119 

PTEN,MUC16,IDHl 

1.3 X 10'^ 

130 

PTEN,LRP1B,IDH1 

6.9 X 10“® 

112 

TTN, PIK3R1,PDGFRA(A) 

2.5 X 10“^ 

109 

EGFR,NF1,PKHD1 

1.1 X 10-* 

117 


Table 2: Collections of alterations reported by Multi-Dendrix and CoMEt on Pan-Cancer glioblastoma 
data pOj (with k = 3 and t = 4). Bolded alterations indicate differences between alteration datasets with 
and without MutSigCV filter. 


3.2.4 Comparison to Multi-Dendrix on real data 


Because CoMEt conditions on the observed alteration frequencies, we argue that it is less biased towards 
gene that have high mutation frequencies because of their higher background mutation rates; e.g. long genes. 
To illustrate this point, we compare CoMEt with Multi-Dendrix on Glioblastoma (GBM) and Breast cancer 
(BRCA) with and without the MutSigCV Q filter that requires that frequently mutated genes have low 
MutSigCV g-values (See Section [SXT for details). We ran CoMEt and Multi-Dendrix with k = 3 and f = 4 
on GBM and k = A and f = 4 on BRCA. We used mutation data from the TCGA Pan-Cancer dataset | 
which contains whole-exome and copy number array data, and downloaded MutSigCV output from the 
corresponding Synapse repository (syn2812925). We used different TCGA GBM and BRCA datasets here 
than we present in Section [^because of the availability of MutSigCV results on the Pan-Cancer dataset. 
Eor each cancer, we generated two datasets. In one dataset, we applied a MutSigCV filter to remove highly 
altered genes (altered in > 2.5% of samples) but insignificant by MutSigCV (q-value < 0.1). The second 
dataset did not include any MutSigCV filter. 
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We found that CoMEt produces almost identical results when using the alteration dataset with or without 
the MutSigCV filter, in both GBM and BRCA (See Table and Supplementary Table |S1|). In contrast, the 
Multi-Dendrix results are different with and without the MutSigSV filter. Without the MutSigCV filter, 
Multi-Dendrix output includes highly altered genes in GBM (including TTN and MUC16) that are known 
to have high background mutation rates. Furthermore, Multi-Dendrix gives the collection including TTN 
and MUC16 higher weight W' than the highest weight collection obtained with the MutSigCV filter. Multi- 
Dendrix also has quite different results between the datasets with and without the MutSigCV filter in BRCA, 
while CoMEt is largely consistent. This observation demonstrates that genes with high alteration frequencies 
can dominate the mutual exclusivity signal in Dendrix weight function W{M) and bias the algorithms 
towards identifying gene sets where the majority of the coverage comes from one gene, while CoMEt gives 
less weight to these genes, which are likely not cancer genes. 


3.3 CoMEt results on real cancer datasets 

We ran CoMEt on four mutation datasets from TCGA: glioblastoma (GBM) Q, breast cancer (BRCA) Q, 
gastric cancer (STAD) Q and acute myeloid leukemia (AME) Q. Because CoMEt can analyze any type of 
binary alterations, we include many types of alterations in these datasets: small indels and single nucleotide 
variations, copy number aberrations, aberrant splicing events, gene fusions, and (for BRCA and STAD) 
cancer subtype. See Section S2.1 for details on these datasets and Supplementary Section SI.2 for details 
on parameters. 
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Figure 4: CoMEt results on TCGA AML. Each circle represents the alterations in a gene or genomic 
region with the number in the circle indicating the number of samples in which the alteration occurs. Black 
lines are edges in the marginal probability graph with indicated probabilities. Orange polygons indicate the 
sets in the collection M with the most significant value fh(M). <I> values are shown for each top set (orange 
polygon). 
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Acute myeloid leukemia (AML) We first ran CoMEt with t = 4 alteration sets, each of size A: = 4. 
The CoMEt output contains four mutually exclusive modules that include 18 alterations (EigurejS^. These 
4 modules are: (1) TP53, RUNXl, NPMl, PML-RARa (52.5% of samples); (2) KDM6A, FLT3, tyrosine 
kinases, RAS proteins, serine/threonine kinases, DNMT3A, MTJ -X fusions, MYHll-CBpp, and RUNXl- 
RUNXITI fusion (70% of samples); (3) cohesin complex, other myeloid transcription factors, and other 
epigenetic modifiers (33% of samples); (4) TET2 and IDH2 (18.5% of samples). 

The recent TCGA AME publication Q reported strong mutual exclusivity (using an earlier version of 
CoMEt algorithm, called Dendrix++) across several expert-defined classes. Thus, we increased the value of 
k compute t = A gene sets with sizes A: = 6,4,4,3. Because of the larger values of k, we increased the 
number of MCMC iterations to 200 million. The resulting marginal probability graph (6 = 0.179) contained 
4 mutually exclusive modules with a total of 19 genes (Eigurej^. 

The first module contains six perfectly mutually exclusive alterations. These six alterations include: mu¬ 
tations in TP53, RUNXl, NPMP, PML-RARa, MYHll-CBFfl fusion genes, and other MLL fusions, which 
we denote as MLL-X fusions, following Q. These six alterations are known to be drivers in AME, and to¬ 
gether are found in 63.5% of the samples. These fusion genes are defining aberrations for certain subtypes of 
AME, as PML-RARa, MYH11-CBF(3, and MLL-fusions are associated with acute promyelocytic leukemia, 
acute monoblastic or monocytic leukemia, and acute megakaryoblastic leukemia, respectively. The second 
module (altered in 63% of samples) contains receptor tyrosine kinases (RTKs) and their downstream RAS 
target proteins. These include mutations in the FLT3 tyrosine kinase, other tyrosine kinases, serine/threonine 
kinases, and RAS proteins. Two additional genes DNMT3A and KDM6A, are also included in this set. These 
genes are involved in DNA/histone methylation, and their interactions with the other RTK/RAS genes in 
the set are less clear. Notably, the marginal probability graph (Eigurej^ shows that the connection between 
DNMT3A and other genes in the set is largely due to its mutual exclusivity with other tyrosine kinases, and 
in fact a number of samples have mutations in both FLT3 and DMNT3A (Eigure |S10| ). Thus, the patterns 
of exclusivity/co-occurrence between alterations may be subtle, demonstrating the advantages of CoMEt’s 
approach to simultaneously examine multiple collections of sets of alterations. 

The third module (altered in 35.5% of samples) contains genes related to chromatin modification and 
gene regulation including ASXLl, the cohesin complex, other myeloid transcription factors, and other epi¬ 
genetic modifiers. Einally, the fourth module (altered in 24.5% of samples) contains genes related to DNA 
methylation including TET2, IDH2 and protein tyrosine phosphatases. Mutual exclusivity between TET2 


and 1DH2 in AME has been previously reported [41 -431. Moreover, recent work provides a mechanistic 


explanation for this observed exclusivity: Eigeroa et al. |411 show that mutant 1DH12 inhibits TET2’s func¬ 
tion in demethylation of 5-methylcytosine. These results demonstrate that CoMEt is able to extract multiple 
functional modules directly from alteration data. 


Glioblastoma multiforme (GBM) We ran CoMEt on the TCGA GBM dataset from | |2T] | with t = A and 
k = A. While |21| removed amplifications in EGER because they were so frequent it confounded their 
analysis, we added these amplifications back when running CoMEt, treating EGER amplifications and TP53 
as subtypes so they could not be sampled in the same set (See Section [277] for details). The resulting marginal 
probability graph (<5 = 0.263) includes two mutually exclusive modules (Eigurej^a)). 

The first module includes: three genes in the Rb signaling pathway {CDK4, RBI, CDKN2C) and three 
genes in the p53 signaling pathway {TP53, MDM2, and MDM4)), as annotated by the original TCGA GBM 
publication Q- This module also contains deletions in CDKN2A, which is a member of both the Rb sig¬ 
naling and p53 signaling pathways. Indeed, it is well known that different isoforms of the CDKN2A gene 
are involved in the Rb and p53 signaling pathways (See Eigure [^b) and also ||T|) and the genomic deletion 
of CDKN2A affects both isoforms. Moreover, we find that the pairs CDK4-RB1 and MDM2-TP53 have 
surprisingly co-occurring alterations (P = 6 x 10“^^; See Eigure |^b)). This co-occurrence is stronger 
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Figure 5: CoMEt results on (a) TCGA GBM. Style is the same as in Figure except for the addition of 
character(s) inside parentheses after a gene name indicates the type of alterations in the gene: D: deletion, 
A: amplification, (b) Different splice variants of CDKN2A are part of both the Rb signaling (left) and p53 
signaling (right) pathways. CoMEt recovers this relationship as two separate exclusive gene sets. The Rb 
signaling {RBI and CDK4) and p53 signaling {MDM2 and TPS3) gene sets (not including CDKN2A) exhibit 
a statistically significant number of co-occurring mutations (P = 6 x 10“^^), where the co-occurrence 
between the pairs (dotted orange line) is more significant than between any of the pairs of genes (dotted red 
line). 


than the co-occurrence of alterations in individual genes. This pattern indicates that glioblastomas can alter 
the function of the Rb and p53 signaling pathways either by deleting CDKN2A, or by altering one gene in 
each of the pairs {CDK4, RBI) and {TP53, MDM2). We emphasize that CoMEt identified fhis overlapping 
module by sampling non-overlapping exclusive sefs. Einally, fhis module confains alferafions in fhree ad- 
difional genes: NPAS3, VAV2, and MSL3. NPAS3 has been sfudied as a novel lafe-sfage acfing progression 
factor in gliomas wifh fumor suppressive functions [44 451. VAV2 has been reporfed to regulafe EGFR, 
and knockdown of VAV2 enhanced EGER degradafion and furfher reduced cell proliferafion 1461. MSE3 is a 
member of fhe male-specific lefhal (MSE) complex and is fhoughf fo play a role in franscripfional regulafion. 
As reporfed in | [2l] |, fhe MSE complex also includes MOE, which regulafes p53 in cell cycle and may be 
involved in cancer |47|. We nofe fhaf Mulfi-Dendrix identifies similar Rb and p53 signaling modules |211, 
wifh fhe imporfanf difference being fhaf CoMEf correcfly places CDKN2A in a module wifh bofh fhe Rb and 
p53 signaling pafhways, consisfenf wifh fhe figure in fhe TCGA GBM publication Q- 

The second module includes alferafions in fhe PI(3)K signaling pafhway - including PIK3R1, PTEN, 
delefion of PTEN and IDEIl - as well as amplificalions in fhe genes {EGER, PDGERA) and in a region con- 
faining PRDM2 and PDPN. Addifional genes in fhis module are NEl and SRCRB4D. The PI(3)K signaling 
pafhway genes overlap fhe resulfs reporfed by Mulfi-Dendrix on fhis dafasef in |211, wifh fhe imporfanf dif¬ 
ferences being CoMEf includes NEl and amplificalions in EGER (which were nol analyzed by |[2T|). In fhis 
module, we also found one mulually exclusive gene sel (from fhe highesl weigh! colleclion) fhaf includes 
EGER, IDEIl, NEl, and PDGERA. Alferafions in Ihese genes have sfrong associafion wifh individual sub- 
lypes in GBM 1371: EGER amplification is associaled wifh fhe Classical GBM sublype, IDHl and PDGERA 
amplification are associaled wifh Proneural GBM sublype, and NEl is associaled wifh Mesenchymal GBM 
sublype. This shows fhaf mulually exclusive gene sefs can resulf from sublype-specific mulalions. 

Einally, SRCRB4D is a scavenger receplor wifh no known associations wifh cancer. However, fwo ofher 
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scavenger receptor genes have previously reported roles in glioblastoma. Homozygous deletions of DMBTl 
were reported in glioblastomas and astrocytomas |[48|[49|. CDS6 was recently reported to be involved in 


cancer stem cell maintainence in glioblastoma |50|. 

These results show that CoMEt can automatically find large portions of the pathways that were manually 
curated in TCGA GBM publication |[T| , including overlapping pathways. Moreover, CoMEt identifies addi¬ 
tional genes with putative roles in glioblastoma and significant exclusivity with other known glioblastoma 
genes. 
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Eigure 6 : CoMEt results on (a) TCGA STAD subtypes, (b) TCGA BRCA subtypes. Style is the same as 
in Eigure except for the addition of subtype alterations (brown) and that character(s) inside parentheses 
after a gene name indicates the type of alterations in the gene (AS: alternative splicing event, F: fusion 
gene). Here, an edge between a subtype and an alteration indicates that the alteration is associated with the 
subtype. 


Gastric cancer (STAD) We performed two runs of CoMEt on the TCGA gastric cancer (STAD) dataset, 
and then merged the runs (described in Section [T7| ). We first ran CoMEt with f = 4 and k = A. We then ran 
CoMEt on a STAD dataset that included sample subtype classifications. TCGA recently classified gastric 
cancers into four subtypes based on integration of different molecular data Q. To examine the relationships 
between subtypes and other alterations, we introduce “subtype alterations” for the three subtypes from Q 
(we did not include the hypermutated samples from the MSI subtype in our analysis). As described in 


17 































Section 2.7 these “subtype alterations” are marked as altered in samples that are not members of the subtype, 
so that mutual exclusivity between an “subtype alteration” and another alteration indicates that the alteration 
is enriched in the subtype. We ran CoMEt on the STAD dataset with subtype alterations using k = A and 
t = 3 (the number of subtypes). 

CoMEt identified five mutually exclusive modules from the marginal probability graph {6 = 0.132) in 
the STAD dataset (Eigurej^a)). Each of these modules includes known cancer genes and novel candidate 
genes. Two modules indicate subtype-specific altered genes and pathways. The first module (altered in 69% 
(150/217) of the STAD samples) includes two genes, TP53 and PIK3CA, that are enriched for alterations 
in the CIN and EBV subtypes, respectively. TCGA gastric study reported that 80% of EBV tumors contain 
an alteration in PIK3CA, and they suggested that EBV tumors might respond to PI3-kinase inhibitors Q. 
Given this strong signal, it is not surprising that these two genes appear in CoMEt results. However, these 
signals do not dominate the CoMEt results, and four other interesting modules are also output. There are six 
other mutated genes in this first module including MAP2K7, TLNl, BAT2L1, C12orf63 (recently renamed 
CFAP54), MYOM3 and PTPRJ. Given the rarity of these mutations, their significance is unclear. 

The second STAD model includes the genomically stable (GS) subtype, mutations in CDHl, mutations 
in PCDHAll, ARHGAP6-CLDN18 fusions, and amplification of a region containing EPHB3. CDPIl so¬ 
matic mutations and ARHGAP6-CLDN18 fusions were reported to be mutually exclusive and enriched in 
the genomically stable subtype in gastric cancer Q, and CoMEt recapitulates this result. EPHB3 is the 
member of Eph/ephrin signaling which controls the compartmentalization of cells in epithelial tissues. A 
recent study | [5T[ demonstrated EphB receptors (e.g. EPHBl and EPHB3) interact with CDHl in epithelial 
intestinal cells that regulates the formation of E-cadherin-based adhesions. This interaction explains the 
perfect mutual exclusivity between CDHl and EPHB3, which to our knowledge is the first report of this re¬ 
lationship. This demonstrates that mutual exclusivity between pairs of alterations/subtypes may have subtle 
explanations, further underscoring the need for analysis of collections of multiple alterations. 

The third module (altered in 95/217 of samples) includes amplifications of KRAS and ERBB2, and 
mutations in BTBDl 1. KRAS and ERBB2 are members of the RTK/RAS signaling pathway, and their role 
in cancer is well-documented. Eittle is known about the function of BTBDl 1, and thus the significance of 
the mutations is unclear. 

The fourth STAD module (115/217 of samples) contains three altered genes, including amplifications 
of CCNEl, mutations in SMAD4 and splice-site mutations in MET. CCNEl is a well-known cell cycle 
mediator, and SMAD4 is a member of the TGE-/? pathway, and MET participates in the RTK/RAS signaling 
pathway Q. 

The fifth STAD module (79/217 of samples) contains four altered genes, including amplifications in 
a region with 1L7R and TIER, deletions in a region with HDACIO and BRDl, mutations in ARIDIA, and 
mutations in CNBDl. ARIDIA is a well-known cancer gene shown to be significantly mutated in gastric 
cancer 0- Moreover, inhibition of HDACIO has been reported with association with human gastric cancer 
cells |[5^. Gain-of-function mutations in 1L7R have been reported to associated with childhood acute lym¬ 


phoblastic leukemia |531. Our CoMEt results suggest that 1L7R mutations may have a role in gastric cancer 
as well. 


Breast cancer (BRCA) We performed two runs of CoMEt on the TCGA breast cancer (BRCA) dataset, 
and then merged the runs. We first ran CoMEt with k = A and t = A. We then introduced subtype alterations 
for four subtypes from Q (as described in Section |2.7| ). Breast cancers are traditionally classified into 
multiple subtypes based on mRNA expression. Here we analyze four subtypes: luminal A, luminal B, basal- 
like, and HER2-enriched. We ran CoMEt on a BRCA dataset that included sample subtype classifications 
with k = A and t = A (the number of subtypes). 

CoMEt identified three subtype-specific modules and three modules with mutated genes (Eigure [^b)) 
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in the marginal probability graph (6 = 0.287). The first module shows the strong association between 
amplification of CCNDl and the luminal B subtype as previously reported | [54t . Similarly, the third module 
shows the strong association between ERBB2 amplification and the HER2 {ERRB2) -enriched subtype. 

The second module shows a complicated relationship between: (1) subtype-associated alterations in 
the Luminal-A and Basal-Like subtypes, and (2) mutual exclusivity resulting from alterations in the same 
pathway(s). This module contains five sefs of genes (highlighfed in orange in Figure |^b)) in fhe highesf 
scoring collection M oufpuf by CoMEf. Consisfenf wifh TCGA sfudy 0, we find fhaf: CDHl, AKTl and 
PIK3CA are associafed wifh fhe luminal A subfype, and fhese form a sef in fhe CoMEf oufpuf. Similarly, 
TP53 and amplificalion of chromosome region 4ql3.3 are associated wifh fhe basal-like subfype, and also 
form a sef in fhe CoMEf oufpuf. Two of fhe ofher sefs confains genes in fhe same pafhway. PTEN is a known 
inhibitor of PIK3CA, explaining fhe observed exclusivify befween PTEN deletion and PIK3CA mufafion. 
Moreover, MCLl, MAP3K1, AKTl are all pari of fhe PI(3)K/Akl signaling pafhway. Togelher, fhese sefs 
confain five genes lhal are annofaled as pari of fhe PI(3)K/Akl signaling pafhway in TCGA sfudy 0. (red 
circles in Figure 0b)). 

The final sef in Ihis module includes mufafions in fhe genes TP53, CDHl, GATA3 and CTCE. These 
four genes are altered in 54.83% (278/507) of fhe BRCA samples. TP53 is a member of fhe p53 signaling 
pafhway, while CDHl, GATA3, and CTCE all have been reported as polenfial driver genes in breasl cancer. 
CDHl is a lumor supressor lhal is well-known fo play mulliple roles in cancer | |55| , including invasion and 
proliferalion in breasl cancer p6| . GATA3 is a Iranscripfion faclor lhal has long been known lo be involved in 
breasl cancer lumorigenesis | |57| . Recenfly, GATA3 has been reporled lo promofe differenlialion, suppresses 
mefaslasis and alter fhe lumor microenvironmenf in breasl cancer p^. As nofed by Mulli-Dendrix |[2T|, 


GATA3 has also been reporled lo suppress lumor melaslases Ihrough inhibilion of CDHl promolers | [59| , 
which suggesls lhal fhe mufafions in GATA3 are an allernale way lo downregulale CDHl and may explain 
fhe exclusivify of fhe mufafions in GATA3 and CDHl. Moreover, GATA3 is enriched for mufafions in bolh 
luminal A and luminal B, i.e. 32 of fhe 54 mufafions in GATA3 occur in luminal A{P = 0.0207) and 32 of 
fhe 54 mufafions in GATA3 occur in luminal B (P = 0.065). This mighl suggesl GATA3 mufafions mainly 
occur in palienls wifh luminal breasl cancer. CTCE neighbors CDHl on chromosome 16q22.1 and has been 
reporled wifh CDHl to be a lumor suppressor in breasl cancer | [^[M| . Inlereslingly, bolh CDHl and CTCE 
have mosl of fheir mufafions in samples of fhe luminal A subfype. CDHl is enriched for mufafions in luminal 
A (as reported in Q) and 9 of fhe 13 mufafions in CTCE occur in luminal A (P = 0.0891), suggesting fhese 
Iwo genes are in a pafhway specifically largeled in luminal A. Furlhermore, 4 of fhe 9 mufafions in CTCE 
in luminal A are missense mufafions in zinc finger domains, suggesling possible functional role for fhese 
mufafions | [62l |. 

Togelher, fhese resulls demonsfrafe CoMEl’s abilify lo simulfaneously idenlify alferafions lhal are mu- 
lually exclusive due to inleracfions befween genes in palhways or due fo sublype-specific alferafions. This 
allows a more refined inlerprelalion of mulually exclusive alferafions fhan simple pairwise analyses. 


4 Discussion 


We inlroduce fhe CoMEf algorilhm for identifying colleclions of mulually exclusive alferafions in cancer de 
novo, i.e. wifh no prior biological knowledge. CoMEf uses a novel slalislical score for exclusive alferafions 
lhal condifions on fhe frequency of each alleralion and Ihus can defecl exclusivify of rare mufafions. CoMEf 
overcomes large compulalional challenges in compuling fhe score using a new algorilhm for contingency 
fable analysis, and in oplimizing fhe score in genome-scale dafa using fhe firsl Markov chain Monte Carlo 
(MCMC) algorilhm for idenfifying collections of exclusive alferafions. 

We demonsfrafe lhal CoMEf is superior fo earlier de novo mefhods - Dendrix fT^, muex |[23|, and 


Mulfi-Dendrix |211 - on simulaled and real dafa. We Ihen apply CoMEf lo large mufafion dafasels from 
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multiple TCGA cancer types On each dataset, CoMEt identifies significantly exclusive collec¬ 

tions of alterations that overlap well-known cancer pathways, as well as implicate novel cancer genes. In 
addition, CoMEt illustrates subtle relationships between mutual exclusivity resulting from cancer subtypes 
and exclusivity resulting from pathways or protein interactions. These findings provide testable hypotheses 
for further downstream analysis or experimental validation. 

The input to CoMEt is a matrix of binary alterations, and thus can be used to analyze a variety of 
alterations including point mutations and indels, copy number aberrations (amplifications and deletions) 
and complex rearrangements, splice-site mutations, gene fusions, and subtype annotations. CoMEt may be 
useful in analysis of other types of alterations; e.g. germline variants. 

Another application for CoMEt is pan-cancer analysis, such as the recently published TCGA study Q 
and the upcoming ICGC pan-cancer project. Since pan-cancer datasets have many cancer-type specific 
alterations, CoMEt’s ability to simultaneously analyze type-specific and other types of exclusive alterations 
should prove useful for this analysis. Einally, we anticipate that the novel tail enumeration strategy used 
in CoMEt may be of broader interest, both for examining mutual exclusivity in other datasets, including 
non-biological data, as well as for adapting for other types of exact statistics. 
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SI Methods 


Sl.l MCMC Algorithm 

We define a Markov chain whose states Q are possible collections M and where transitions between states 
(collections) are defined such that the chain is ergodic. Finite and ergodic Markov chains converge to a 
unique stationary distribution. In this case, because we want to sample from collections M in proportion to 
their weights our desired stationary distribution is 


ttm 




( 5 ) 


Note that we use so more exclusive collections have higher weights. The Metropolis-Hastings 

algorithm p4|[^ is a method for defining transition probabilities for an irreducible Markov chain such that 
the modified chain is ergodic and has a desired stationary distribution. A Metropolis-Hastings algorithm to 
sample collections M according to this stationary distribution is as follows: 

It is easy to see that this chain is ergodic (it is possible to reach any state (collection) from any other state 
(collection), it is finite, and it is not bipartite) and thus it converges to our desired stationary distribution. 
We apply a parameter a to increase/decrease the difference between d>(M'^) and <F(M 7 v)- Also, in the 
second step of the algorithm, we ensure that the number of exclusive alterations is larger than the number of 
co-occurring by checking that the Dendrix weight W (M) > 0. This is to avoid examining sets alterations 
with high coverage (e.g. altered over 90% of samples) that may have significant exclusivity even though 
relatively few samples harbor exclusive alterations. We assess convergence of the MCMC algorithm by 
calculating total variation distance of the the sampling distributions from multiple chains with different 
initializations (See Supplementary Section Sl.l|). The MCMC algorithm consists of the following steps: 


Initialization. Choose tk genes uniformly at random from £, and assign k genes at random to initialize 


Iteration. For N = 1,2,, obtain Mjv+i from Mtv as follows: 

1. Select a gene g uniformly at random from £. 

2. Define the proposed collection Mjy as follows: 

i) If <7 ^ Mtv, then choose uniformly at random gene g' G Mi, and replace g' with g. 

ii) Else, choose uniformly at random gene g' G Mi, and swap genes g and gk Note that if 
g, g' G Mi, then Mi will be unchanged. 
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3. Let P(MAr, M'^) = min{l 




}• 


4. With probability P(Mjv, M'^), Mat+i = M'^, else Mat+i = Mat- 


Convergence of MCMC from different initial gene sets To assess convergence of the MCMC algorithm, 
we compare the sampling distributions from multiple chains with different initializations. The idea is that 
if multiple chains have converged, by definition they should appear very similar to one another; if not, one 
or more of the chains has failed to converge. We create the following pipeline for performing CoMEt on 
all our experiments. To make the initializations have high variety, we create 5 to 10 initializations for any 


t and k of CoMEt. One initialization is from Multi-Dendrix | [2T| results with the same t and k, and other 
initializations generate randomly. 

Precisely, we first define whole sampling gene sefs Q as fhe union of fhe lasf 0.5% of sampling gene 
sefs uji of each chain i. We fhen define fhe disfribufion for each chain i over fhe whole sampling gene sefs 
O = U\/iiOi as Pi{s) = Fi{s)/\uji\ if s G O ofhereise 0, and fhe disfribufion for fhe union chain u over fhe 
whole sampling gene sefs O as Pu{s) = F„(s)/|r2|. We calculafe fhe mean fofal variation disfance over Pi 
and Pu- The fofal variation disfance befween Pi and Pu is defined as 


\Pi - PuWtv = max||Pi(s) - Puis) 


( 6 ) 


We sfarf CoMEf wifh 100 million iferafions for each of fhese inifializafions. We examine convergence by 
calculafing bofh mefrics across fhe inifializafions. If fhe mefric is close fo 0, fhis indicafes fhe convergence 
and fhe process will be slopped; olherwise, we increase 1.5 limes of fhe number of iterations until the number 
of iterations touches 1 billion. In final, we use fhe union of fen sampling disfribufion as sampling resulls. 
Eor example, we performed above pipeline for f = 3 and /c = 4 on AME mufafion dala and plofled fhe 
disfribufion of fofal varialion disfance after 1 million iteration and 10 million iteration MCMC runs (Eigure 

m. 


S1.2 Parameter selection 

We selecl 6 wifh fhe following heurisfic procedure. When we run CoMEl wifh t sefs in fhe collection, ideally 
we should obfain t cliques in fhe marginal probabilily graph. To find fhe besl 6 lhal fulfills fhe expecfafion, 
we search for an “E-comer” in a graph of fhe number of edges in fhe marginal probabilily graph as a function 
of fhe edge weighl. 

More precisely, we first plot a log-log distribution with the number of edges in the marginal probability 
graph with edge weight > p against edge weight p (Eigure [S^. We choose 6 starting from the minimum 
edge weight Pmin that contains at least t edges in the marginal probability graph, e.g. the yellow 
horizontal line in Eigure]^ shows the number of edges in GBM with k = 3 and t = 3. We identify a value 
6 where the number of edges increases dramatically after this value as the probability threshold decreases. 
To find fhis value, for each value x we perform a linear regression of fwo besf-fif lines (using root mean 
squared error) before and after this value. We the first p > pmin that forms a “E-corner”, i.e. the slope of the 
two best-fit lines changes from a smaller negative value to a larger negative value as the value x decreases 
(e.g. moving leftward in EigurejS^. 

Eor each cancer dataset, we ran CoMEt with /c = 4, f = 4, and 100 million iterations using 5 to 10 
random initializations. Eor BRCA and STAD with subtypes, we ran CoMEt with /c = 4 and t equal to 
the number of pre-defined subtypes (4 and 3, respectively), and 100 million iterations using 10 random 
initializations. (See Section S2.1 and Supplementary Section Sl.l| ). Ideally, CoMEt should be run with the 
largest values of k and t that are biologically meaningful for a particular dataset. If smaller values of k 
and t are best supported by the data, the summarization procedure will demonstrate this. In practice, using 
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large values of k and t might lead to long run times and slow convergence of the MCMC algorithm, since 
the space of possible collections will be very large. Thus, an alternative approach that we use to generate 
results is to run with small values of t and k (e.g. f = 3, 4 and /c = 3,4) and examine the resulting marginal 
probability graph. If there t or more cliques or approximate cliques in the graph, this suggests the use of 
larger values of t and k. We used this approach to find larger collections in the AML dataset (See details in 
Section [33] ). 


S2 Data 

S2.1 Somatic mutation datasets 

Acute myeloid leukemia (AML) The AML dataset contains whole-exome and copy number array data 
in 200 AML patients from The Cancer Genome Atlas (TCGA) Q. Using the annotations in Q, we 
categorized multiple genes together based on expert knowledge, which results in 9 categories including 
spliceosome, cohesin complex, MLL-X fusions, other myeloid transcription factors, other epigenetic mod¬ 
ifiers, ofher fyrosine kinase, serine/fhreonine kinase, profein fyrosine phosphafase, and RAS profein. More 
defails are in Q. This resulfs in 51 genes and 200 pafienfs. 


Glioblastoma multiforme (GBM) We analyzed fhree GBM dafasefs: 


1. The Mulfi-Dendrix GBM dafasef from . This dafasef confains whole-exome and copy number 
array dafa in 261 GBM pafienfs and 398 genes from The Cancer Genome Aflas (TCGA) Q. Dafa 


preparafion for GBM can be found in |211. Nofe fhaf in Secfion 3.3 we included amplificalions in 


EGFR which were nol considered in [211. 


2. The muex GBM dafasef from |23|. This dafasef confains 83 alferafions in 236 samples from ||T|, 
including SNVs in genes identified as significanfly mufafed by MufSigCV 0 and CNAs called by 
GISTIC2 1631 fhen resfricfed fo fhose wifh significanfly concordanf gene expression (higher for am¬ 
plifications, lower for deletions). 


3. The Pan-cancer GBM dafasef from Q. We analyzed fhe non-silenf mufafions (single nucleotide vari- 
anfs and small indels) from fhe MAP file and focal copy number aberrafions from GISTIC2 oufpuf. 
This dafasef confains 509 genes in 291 samples. Moreover, we removed genes wifh non-silenf mu¬ 
fafions in < 1% of samples and wifh mufafions in > 2.5% of samples wifh MufSigCV 0 ^-value 
>0.1. This dafasef confains 406 genes in 291 samples. 


Gastric cancer (STAD) We analyzed fhe non-silenf mufafions (single nucleofide varianfs and small in¬ 
dels) from fhe MAP file in 289 gasfric cancer samples. We also included focal driver copy number aberra¬ 
fions from GISTIC2 oufpuf via Pirehose, fusion genes, rearrangemenfs and splicing events 0. We removed 
74 hypermutators and genes with non-silent mutations in < 2.5% of samples and with mutations in > 3% 
of samples with MutSigCV 0 i^-value > 0.25. This process results in 217 STAD patients and 397 genes 
with mutations. We considered four subtypes identified by TCGA Q, including tumors were positive for 
Epstein-Barr virus (EBV), tumors had high microsatellite instability (MSI) genomically stable (GS) tumors 
with a low level of somatic copy number aberrations, and chromosomally unstable (CIN) tumors with a 
high level of somatic copy number aberrations and were called. We do not analyze the MSI subtype since 
samples in MSI are hypermutators. 
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Breast cancer (BRCA) The BRCA dataset contains whole-exome and copy number array data in 507 
BRCA patients and 375 genes from The Cancer Genome Atlas (TCGA) Q. Data preparation for BRCA 
can be found in 0 . We downloaded subtype information of BRCA from The Cancer Genome Atlas 
(TCGA) 0 . We considered four subtypes - basal-like, HER2-enriched, luminal A, and luminal B - that 
each contain at least 10 % of the total samples. 

S2.2 Simulated data 

We generated simulated datasets using the following approach. Recall C is a set of highly altered genes 
whose alterations are not necessarily exclusive. 

1. Select k genes to form an “implanted pathway” P. 

2. Let 7 p be the fraction of mutated samples in P. Select 'yp x n samples to be exclusively mutated in 
P, where the proportion of mutations in each gene in P is given by the tuple /rp = (ci,..., c^). 

3. Randomly select samples to be mutated in each gene in C, where the fraction of mutated samples per 
gene is given by 7 c. 

4. For each of the n samples s in each of the m genes g (including the implanted and cancer genes), 
mutate 5 in s with fixed probability q. This step introduces noise into the dataset. 

We used m = 100, n = 500, k = 3, up = (0.5,0.35,0.15), |C| = 5, 7 c = (0.67,0.49,0.29,0.29,0.2), 
and q = 0.00275384620 We removed alterations that occurred in fewer than 5 alterations (resulting in the 
average number of genes of 276.44). We ran CoMEt 100 million iterations from 3 random initial starts. 

S3 Supplementary Tables 


Multi-Dendrix 


CoMEt 


Alterations set 4>(M) W(M) Alteration set 4>(M) W(M) 

Pan-cancer BRCA dataset k — 4,t = 4 

PIK3CA,MCL1(A),ZNF703(A),AKT1 9.2x10"® 403 PIK3CA,ING5(D),PEG3,AKT1 2.6x10"^® 352 

TP53,GATA3,MAP3K1,CDH1 1.7x10"“ 397 TP53,GATA3,CDH1,CTCF 4.1x10"“ 380 

CCNDI{A),MYC{A),MAP2K4,CBFB 5.4x10"^ 301 MT-ND1,MYC{A),LAMA2,MAP3K1 1.7x10"'^ 269 

TUBDl{A),STKll,TCF3m,MLL3,RUNXl 2.2 xl0~^ 2%4 CBFB,FHIT{T>),RUNX1,ZNF703{A) 5.5x10"® 256 

Pan-cancer BRCA dataset without MutSigCV filter, k — 4,t — 4 

PIK3CA,TUBD1(A),PTEN(P),AKT1 2.0x10"'^ 397 PIK3CA,ING5(D),PEG3,AKT1 2.6x10"“ 352 

TP53, GATA3, CDHl, MAP2K4 5.0 x 10"“ 382 TP53, GATA3, CDHl, CTCF 4.1 x 10"“ 380 

CCNDI{A),MYC{A),MUC4,MAP3K1 2.7x10"® 323 MT-ND1,MYC(A),LAMA2,MAP3K1 1.7x10"'® 269 

MCL1(A),ZNF703(A),ENSG1*,CROCCP2 1.2x10"® 294 £/fBB2(A), Pr£iV(D), CROCCP2, £iVSG2* 4.4 x 10"® 249 


Table SI: Collections of alterations reported by Multi-Dendrix and CoMEt on TCGA Pan-cancer 
breast cancer data | [40| with k = 3 and t = 4. Bolded alterations indicate differences between al¬ 
teration datasets with and without MutSigCV filter. *Due to the limited width of the page, we replaced 
ENSG00000210082 as ENSGl and ENSG00000211459 as ENSG2. 


®We chose values for C and q using values calculated from real data. We choose C to match the mutation frequencies of the 
five most mutated genes in the TCGA glioblastoma dataset. We calculated q empirically from the TCGA breast cancer mutation 
matrix. 
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t=2 t=3 t=4 

Coverage (0.70, 0.50) (0.70, 0.60, 0.50) (0.70, 0.60, 0.50, 0.40) 


Table S2: Coverages 7 p for eaeh simulated dataset with non-overlapping pathways. 


t 

k 

Class 

Consensus 

CoMEt 

True Highest weight 

Multi-Dendrix 

Consensus True 

2 

3 

Non-overlapping 

0.53 

1.0 


0.65 

0.43 

0.49 

2 

4 

Non-overlapping 

0.51 

1.0 


0.79 

0.53 

0.48 

2 

5 

Non-overlapping 

0.74 

1.0 


0.76 

0.45 

0.48 

3 

3 

Non-overlapping 

0.68 

0.97 


0.85 

0.45 

0.65 



Overlapping 

0.66 

0.94 


0.78 

0.48 

0.69 

3 

4 

Non-overlapping 

0.88 

1.0 


1.0 

0.51 

0.65 



Overlapping 

0.82 

0.95 


0.95 

0.69 

0.68 

3 

5 

Non-overlapping 

0.94 

1.0 


0.88 

0.57 

0.64 



Overlapping 

0.97 

0.96 


0.91 

0.58 

0.6 

4 

3 

Non-overlapping 

0.8 

0.97 


0.73 

0.37 

0.5 

4 

4 

Non-overlapping 

0.92 

1.0 


0.85 

0.47 

0.54 

4 

5 

Non-overlapping 

0.83 

1.0 


0.73 

0.47 

0.5 


Table S3: Results of CoMEt and Multi-Dendrix on simulated datasets consisting of t implanted path¬ 
ways, each with k genes. The mean average adjusted Rand index aeross 25 simulated datasets for different 
parameter ehoiees. We eompared CoMEt to Multi-Dendrix using eaeh algorithm’s eonsensus proeedure 
(“Consensus”). We also eompared CoMEt to Multi-Dendrix when run with the true values of t and k of the 
implanted pathways (“True“). 
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Supplementary Figures 




Figure S1: Plots of total variation distance distribution in each iteration of for an MCMC run with 
IM iterations (left) and an MCMC run with lOM iterations (right). 


a. 2^ contingency table Table X 


T(X) = 57 
r(X) = 75 
4)(X) = 0.663 
MEMO = 0.718 


Not mutated 


Gene 1 


Gene 3 


Mutated 


Gene 1 


Not mutated ; Mutated 


Not mutated i Mutated 


c Not mutated • 50 i 42 

a .I.^. 

Mutated ! 7 ! 12 


c Not mutated i 8 • 3 

O .. \ . 

Mutated i 2 • 1 


b. Less exclusive, same coverage 

T(X') = 56 
r(X') = 75 
4>(X) = 0.734 
MEMO = 0.718 


Not mutated 


Not mutated 
Mutated 


Gene 1 
Not mutated 


Table X' 

Gene 3 


Not mutated 


Gene 1 
Not mutated 


The frequency/coverage 
of each gene is the 
same in each table: 
r(gt) = 58 

r(g2) = 22 

r(g3) = i4 


c. More exclusive, lower coverage Table X” 

T(X") = 58 Mol mutated Gene 3 Mutated 

r(X’') = 70 I [ 

cj3{X) = 0.587 I I 

MEMO = 0.995 '' Gene 1 


Not mutated 
Mutated 


Not mutated 


Mutated 


Not mutated j 2 


Mutated j 0 


0 


12 


Figure S2: Two cases where the MEMo permutation test statistic F (the coverage, or number of al¬ 
tered samples) deflates or inflates the P- value compared to the CoMEt test statistic T (the number of 
samples with exclusive mutations). 
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Figure S3: Scatter plot between negative log of exact and binomial P-values for all sets of A; = 3 
alterations on the GBM dataset (left) and BRCA dataset (right). The color of each dot represents the 
number of co-occurring alterations according to the scale at the right. Note that the P-values for the exact 
test much smaller than the binomial only in cases with relatively low number of co-occurrences. These cases 
are the fastest to compute with the tail enumeration algorithm. 


CoMEt Results 


Significance Plot (N=100) 
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Marginal Probability Graph 

Mutation Matrix 



^PDPN.PRDM2(A] PIK3R1 


PTEN(76)||||||||||||||||||||||||||||i||||||||| 

PTEN(D) (41) .. 

PIK3CA(17) I llllllll 

PPK3R1 (17) II lllllll 

IDH1 (14) I mill 

DPN.PRDM2(A)(13) ' ' 

Coverage: 62.84% (164/261) 

I Show sampled collections of gene sets ] 




Hide sampled collections of gene s 



1170 

1013 


67.2S3 NLFtP3. TPS3 0.0009033116 

67.067 MOM2iA). STAC2, TP63 0.00106643 

67.IK05 C0L6A3. FRMPD4tD). 0.00110612 

MDM4(A) 


1DH1. PTEN. PTEN(D) 
IDH1. PTEN. PTEN(D) 
IDH1.PTEN. PTEN(D) 


1.0eS27»-OB 

i.oes27»-oe 

1.0eS27s-0B 


CDMW. COKN2A4CI). RBI 6.32422»-16 

OOMW. COlWeAO. RBI 6.32422e-19 

CDK4W. CDKN2AO. RBI 6.32422e-ig 


Figure S4: Screenshot of the web application for interactive visualization of CoMEt results. 
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Figure S5: The average rank of the implanted pathway in output from CoMEt (blue), Dendrix (red), 
and muex (green) in 25 simulated datasets using (a) n = 250 and (h) n = 750 samples. The average 
number of ties (i.e. gene sets with the same score as the implanted pathway) are shown as error bars. 


u 

S 

c 

3 

CeS 

eb 



Fraction yp of mutated samples 


Figure S6: Comparison of the average runtime of CoMEt, Dendrix, and muex on = 25 simulated 
datasets with a single implanted pathway. The standard deviation for the runtime is shown as an error bar. 
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Figure S7: The distribution of the number of genes with > x mutations in simulated data. We removed 
those genes mutated in fewer than 1% of mutations, i.e. genes mutated in fewer than 5 samples. 



Figure S8: The distribution of the number of edges with weight > p in GBM with k = 3 and f = 3 in 
log-log scale. The red dot indicates the first hitting edge weight where the change in slope is negative (when 
moving leftward) such that the number of edges in the subgraph is at least f x ( 2 ) = 9 (as the horizontal 
yellow line). 
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Figure S9: CoMEt results on AML dataset with t = 4 and fc = 4, 
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Figure SIO: Mutation matrices for the CoMEt results on (a) TCGA GBM, (b) TCGA AML, (c) TCGA 
STAD, and (d) TCGA BRCA datasets. The matrices have alterations as rows, and samples as columns. 
Each cell indicates whether or not an alteration occurred in a particular sample, where grey indicates the 
sample was not altered. Samples with co-occurring alterations in the same set are colored orange, while 
exclusive alterations are colored blue. 
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